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Abstract 

A long standing mystery in using Maximum Entropy (MaxEnt) is how to deal with constraints 
whose values are uncertain. This situation arises when constraint values are estimated from data, 
because of finite sample sizes. One approach to this problem, advocated by E.T. Jaynes [1], is to 
ignore this uncertainty, and treat the empirically observed values as exact. We refer to this as the 
classic MaxEnt approach. Classic MaxEnt gives point probabilities (subject to the given constraints), 
rather than probability densities. We develop an alternative approach that assumes that the uncertain 
con^fimrirvalues are represehTeM^ _ a'prdFafiility density^jeTgra Gaussian)7~and this uncertainty 
yields a MaxEnt posterior probability density. That is, the classic MaxEnt point probabilities are 
regarded as a multidimensional function of the given constraint values, and uncertainty on these 
values is transmitted through the MaxEnt function to give uncertainty over the MaxEnt probabilities. 
We illustrate this approach by explicitly calculating the generalized MaxEnt density for a simple 
but common case, then show how this can be extended numerically to the general case. This paper 
expands the generalized MaxEnt concept introduced in a previous paper [3]. 

INTRODUCTION 

A mystery in using Maximum Entropy (MaxEnt) inference in practice is: "Where do 
the constraints come from?". The normalization constraint (1) comes from the logical 
requirement that the sum of all probabilities must equal 1, since some event i out of a set 
of events must occur. 

Vi, 0 </>•<!, I>=1, (1) 

i 

However, other constraints, such as the mean value constraint discussed by Jaynes in 
[2] are just asserted, without specifying where they come from and how their constraint 
values are found. Two possible sources of constraints are: 

1. By Definition: Such as the normalization constraint, or constraints derived from 
logical requirements, physical laws, or by assumption. 

2. By Measurement: The value of these constraints is inherently uncertain. ■ 

In the first type of constraint, uncertainty associated with the constraint value does 
not arise. However, for constraints based on measurements, the observed value can only 
be known with a precision dictated by the sample size. The Classic MaxEnt (CME) 
approach has no obvious mechanism for taking this uncertainty into account. Jaynes 
was well aware of this problem, and attempted to address it in his paper [1], at the end 
of section B, where he offered three approaches. The first is to ignore it-just use the 
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empirically determined constraint values as if they are exact. The second solution is to 
generalize the classic maxent approach to accommodate the constraint uncertainty-the 
approach we take in this paper, although Jaynes dismisses this approach as ad hoc. The 
third approach is to introduce constraint uncertainty by adding extra variance constraints. 
We show that the first and last approaches are untenable, while the second approach is 
dictated by the laws of probability. 


THE CLASSIC MAXENT SOLUTION (CME) 


The principle of Maximum Entropy (MaxEnt) is a method for using constraint infor- 
mation to find a set of point probability values, P, that assumes the least (Shannon) 
information consistent with the given constraints. When the given (linear) constraints 
are insufficient to uniquely constrain P to particular point values, MaxEnt picks out the 
unique distribution that satisfies all the constraints and also maximizes the entropy. 

In the case of a finite s et of I mutually e xclusive and exhaustive events, described by 
discrete probabilities P/, the entropy is defined as: 

i—I 

H (?) = -J j P i xLogP i . (2) 

i= 1 

Given a set of J independent linear constraints, including the normalization (1), each of 
the form: 

Aj-P = Cj, ' (3) 

with J < I, the maximum entropy distribution may be found by the following procedure 
[2]: define the partition function: 


z(i) = X ex p(~ X ( 4 ) 

i= 1 7=1 

with the Lagrange multipliers X determined by the set of J simultaneous equations: 

J-log(Z(I))+C ; - = Q. (5) 

Then 

H m ax = k>g(Z(X))+X-C, (6) 

and the corresponding probability distribution is: 


j r 

Pi = Z(X) _1 exp(— X XjAji) = Z(Xf 1 fl jexpi-XjAji). (7) 

7=1 =1 

Explicit solutions for the dice case are given below. These CME values are different 
from the Maximum A Posterior (MAP), Maximum Likelihood (ML) and Posterior Mean 
probability estimators, reflecting the different assumptions built-in. to each estimator (see 
[3] for a discussion of these assumptions). 
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GENERALIZED MAXENT 


Equations (2)-(7) show that the maximum entropy point probabilities Pi are & function 
of the constraint values Cj. If these C/S are estimated from a sample, then Bayes implies 
that our knowledge of their values is approximate and can be expressed as pdfs (e.g., 
Gaussians), whose width deceases with increasing sample size. Using the Jacobian 
determinant of the function relating the maxent ps to the Cj s, the joint posterior pdf on 
the Cj s can in principle be transformed into a joint pdf on the Pi s through the maximum 
Entropy constraint/function. We illustrate this process by a simple example using a three- 
faced dice with an experimentally determined mean value. For the three face dice, there 
are three unknown probabilities P = Pi,P 2 -P 3 , one for each of the three faces. These 
probabilities must satisfy the following linear constraints (i.e, the normalization and 
mean values constraints): 

i = 3 

Pi+Pi+h = l, J j iP l = Pi+2P 2 + 3P 3 = f i, (8) 

: i=l 

where the value // is only known approximately from data. Since all face values are either 
1, 2 or 3, fi must be in the range 1 to 3. Here, the set of linear constraint equations (3) 
reduces to equations (8). In this simple, case there is only one degree of freedom left, and 
this is removed when the additional maxent constraint (2) is also imposed. Using (7), 
the resulting Pi s are: 

P; = exp(-K-zX), (9) 

where K is fixed by the normalization constraint and X is fixed by the mean value 
constraint. Let x = exp(— X) and z = exp(— k). Substituting into (8), we have (after 
eliminating z): 

x^// — 3) +x(// — 2) +// — 1 = 0, (10) 

whose only positive solution is: 


x{/x) 


u — 2 -f- \j : — 3// 2 4- 12 fi — 8 
~~ 2(3 -//) ‘ 


( 11 ) 


The normalizing constant (or partition function) z(//) is found by substituting x(//) from 
(I I) into the normalizing equation to give: 


z(M) = 


2(3-//)- 


( 12 ) 


— 3// 2 +/z(i/— 3// 2 + 12 n -8 - 14) + 4 a/— 3/z 2 + 12// — 8 — 14 
Putting these equations together gives the desire maxent probabilities: 

Pffi) =z.{pi)x(n)\ (13) 

which in turn reduce to: 

2(3 ~/<) 2 


Flip) 
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FIGURE 1. The maxent probabilities for each of the 3 faces as a function of pi. 


ft(/0 

ftM 


(3 - pi) (jz - 2 + y 7 — 3/t 2 + 12/t~-~8) 
10 — 3pi + -\J + 12/t — 8 
(/* — 2 4- \J —3 pi 2 + Y2fi — 8) 2 
10 — 3jU+ y^—3ju 2 + 12^ — 8 ' 


(14) 


This is the maxent solution to the three-faced dice problem as a function of pi, and the 
resulting probabilities are shown in Fig. (1). Note that for pi — 2, Pp = 1/3 for all i, as 
expected, since /< = 2 is the mean value for a "fair" dice, where all three faces are equally 
probable. 

The maxent probabilities (14) can now be substituted into the entropy formula (2) to 
give the three faced dice entropy as a function of /q as shown in Fig. (2). Note that the 
entropy has a maximum ztfi = 2, as expected, since this is the equiprobable entropy. 

Having found the maxent point probabilities P as a function of u for the 3-faced dice 
problem, we now examine what happens if we do not know the value of ^ exactly, 
but instead our knowledge is summarized as a pdf over /z— i.e. From standard 

probability theory* a pdf on one set of variables can be transformed into a pdf on an 
equal number of variables that are relat ed through a set of equations. That is: 

fy(yuy2,-‘-,yn ) = \D\fx(xi,X 2 ,- ■ ■ ,X n ), (15) 
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where |D| is the Jacobian determinant of the transformation of the variables xi,x 2 ,---,x n 
to the variables yi ,y 2 , • • • ,y n , i.e. 


d(xi ,x 2 ,---,x n ) 

d(yi,y2 


3xj 


3xi 

Wi 


3/n 

dx 2 

dx 2 

d%2 

3/1 

3/2 

3/n 

dx n 


dXn 

Wi 

3/2 

3/rc 


(16) 


For the three faced dice case, this is a one dimensional transformation for each P;. For 
example, for the probability function P 2 (ji), [Z> 2 1 is derived from (14) giving: 


| D2 | = A = |VE (17) 
aP 2 2— pi 

and similarly for Pi and P 3 . Note that (17) has a singularity at pi = 2, and that the absolute 
value must be tak en so that \P 2 \ is alway s positi ve. C ombining (15) and (17), we get 
finally: ~~~ 

A W = mfM = I 2-^^ z2\ /yWi ( i8) 

which has the desired effect of mapping the uncertainty about pi onto uncertainty about 
P 2 . The singularity in (17) occurs at the maximum value for P 2 = 1/3, as can be seen in 
Fig. 1 , but the resulting pdf fp 2 (P 2 ) is still normalized, despite the infinite value at this 
boundary. In the extreme case where f j.u) becomes a delta function, the corresponding 
/(Pi ) s also become delta functions, and give the CME result. 

For the above simple three faced dice case, it was possible to do the analysis explic- 
itly. Higher order cases cannot be done explicitly because they involve analytic roots of 
high order polynomials. In such cases, it is relatively easy to approximate the Jacobian 
determinant with a Taylor expansion about the estimated constraint values. In the sim- 
plest case, this involves the linear/tangent plane approximation of the maxent probability 
functions as the constraint values are numerically varied around their maximum values. 
This should be a good approximation provided the maxent probability functions do not 
curve significantly over the range of constraint values. In the three faced die case, Fig. 
( 1 ), the Pi'S can be seen to be approximately straight lines for small pi ranges, showing 
that the linear approximation would work well in this case. 


DISCUSSION 

We call this mapping of uncertainty in the constraint values into uncertainty in the Max- 
Ent probabilities (expressed as pdfs) the Generalized principle of Maximum Entropy 
(GME). We have not previously seen this generalization in the literature, but it is a di- 
rect result of applying probability theory to the situation. In the limit of sample size 
N — » 00 , the constraint values are given exactly, so GME becomes CME. Jaynes in [1] 
end of section B, briefly mentions an approach that resembles GME, but he calls this 
approach ad hoc and does not elaborate. Instead he advocates adding the uncertainty on 
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constraint values as extra constraints, such as a variance (on the constraint values). In 
our three faced dice problem, this would require adding asserting a variance on / 1 . Jaynes 
did . not develop this alternative approach, but if he had he would have discovered that it 
does not work, because the additional constraint(s) are on the wrong space! In CME, the 
constraints are on the space of possible probability values. Pi, not on the values of the 
constraints (such as p), so his extra constraints would not have any direct effect on the 
PiS. Even if it was possible to translate constraints on constraint values into constraints 
on the underlying Pi s, the resulting CME probabilities would be point probabilities, not 
pdfs, and so would not reflect the underlying uncertainty. 

We stress that GME is only a solution to the problem of how to handle uncertain 
constraint values within the maximum entropy inference framework. The resulting pdfs 
avoid overly strong commitment compared with point probabilities. However, if the 
GME pdfs are used for prediction or decision making, it is important to remember 
that they embody other strong assumptions that may be incorrect in particular cases. 
The essential assumption is that the set of constraints used in either GME or CME is 
complete , meaning that no other significant constraints apply. Without good reason to 
believe in the constraint set’s completeness, there is no good reason for believing the 
predictions of GME or CME. See [3] for further discussion on this point 
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